Loan Approval Status
Abstract
Leveraging a Kaggle dataset comprising 4269 records and 13 columns, our project endeavors to forecast individual loan approval status. The dataset incorporates applicant details like education, self-employment status, annual income, CIBIL score, and asset values. Our in-depth analysis, documented on GitHub using R Markdown, employs diverse visualizations, statistical methods, and modeling techniques. The resulting insights have the potential to assist both lenders and borrowers by addressing requirements and reducing rejection rates.
Introduction
Our project, driven by personal experiences as international students, explores the financial challenges of pursuing academic dreams abroad. Having navigated the complexities of loan applications, particularly for education, we recognize broader implications across various markets. The global Personal Loan market, valued at $47.79 billion in 2020, is projected to reach $719.31 billion by 2030. The Global Student Loan market, at $3.93 trillion in 2021, is expected to grow to $8.75 trillion by 2031 (8.7% CAGR). The Global Automotive Finance market, valued at $259.84 billion in 2022, foresees a steady 7.3% CAGR from 2023 to 2030. Simultaneously, the Global Home Loan market, at $4.52 trillion in 2021, is set to soar to $33.3 trillion by 2031 (22.3% CAGR). Additionally, the global FinTech lending market, valued at $449.89 billion in 2020, is projected to reach $4,957.16 billion by 2030. Through our project, we aim to provide valuable insights into loan dynamics, potentially enhancing application efficiency and success rates globally.
Literature Survey
In this literature survey, we explore existing research on loan approval prediction. Reviewing traditional methods, machine learning models, and recent trends will inform our study, addressing gaps and challenges. Sheikh et al. [1] performed a machine learning based analysis of loan approval by employing several models such as Support Vector Machine, Logistic Regression etc. Ndayisenga [2] applied advanced algorithms like Gradient Boosting and Random Forest, through which they concluded the significant emphasis on credit score on the likelihood of loan approval. Additionally, Murthy et al.’s [3] research involved analyzing the probability of loan approval using KNN and Decision Tree, along with a dedicated portal for quick decision making.
Data and Methodology
The project “Loan Approval Dataset” utilizes a comprehensive dataset sourced from Kaggle, featuring records of 4269 applicants. This dataset includes attributes such as Education, No. of Dependents, Self-Employment, Annual Income, Value of Assets, CIBIL Score, and Loan status.
The goal of this project is to analyze applicant records and identify the factors contributing to loan approval.
Data Preparation
Data Collection: The dataset, sourced from the reputable data-sharing platform Kaggle, provides a robust repository of loan applicant records.
Data Cleaning: We performed data cleansing, addressing null values, removing irrelevant columns, validating data types, and ensuring overall dataset consistency for enhanced accuracy and reliability.
Methodological Approach
Descriptive Statistics: The dataset underwent initial exploration through the computation of summary statistics, shedding light on the educational status, employment details, CIBIL Score, and asset values of applicants. The analysis categorized loan status into two variables: Approved and Rejected.
Visualization Techniques: Diverse visualization functions were employed to craft informative charts and graphs, facilitating the effective presentation of findings and enhancing the comprehension of complex patterns.
Hypothesis Testing: Various Statistical tests were conducted to validate hypotheses.
Correlation Analysis: Correlation techniques were applied to examine the relationships between variables concerning loan status.
Variable Factorization: Qualitative variables were factored using the as.factor() function for analysis optimization.
Data preprocessing
Importing the Data
Structure of the data
## 'data.frame': 4269 obs. of 13 variables:
## $ loan_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ no_of_dependents : int 2 0 3 3 5 0 5 2 0 5 ...
## $ education : chr " Graduate" " Not Graduate" " Graduate" " Graduate" ...
## $ self_employed : chr " No" " Yes" " No" " No" ...
## $ income_annum : int 9600000 4100000 9100000 8200000 9800000 4800000 8700000 5700000 800000 1100000 ...
## $ loan_amount : int 29900000 12200000 29700000 30700000 24200000 13500000 33000000 15000000 2200000 4300000 ...
## $ loan_term : int 12 8 20 8 20 10 4 20 20 10 ...
## $ cibil_score : int 778 417 506 467 382 319 678 382 782 388 ...
## $ residential_assets_value: int 2400000 2700000 7100000 18200000 12400000 6800000 22500000 13200000 1300000 3200000 ...
## $ commercial_assets_value : int 17600000 2200000 4500000 3300000 8200000 8300000 14800000 5700000 800000 1400000 ...
## $ luxury_assets_value : int 22700000 8800000 33300000 23300000 29400000 13700000 29200000 11800000 2800000 3300000 ...
## $ bank_asset_value : int 8000000 3300000 12800000 7900000 5000000 5100000 4300000 6000000 600000 1600000 ...
## $ loan_status : chr " Approved" " Rejected" " Rejected" " Rejected" ...
Description of the variables:
loan_id: Loan Application Idno_of_dependents: Applicant Dependentseducation: Graduate/Not Graduateself_employed: Yes/Noincome_annum: Annual Incomeloan_amount: Loan Valueloan_term: Loan Term in Yearscibil_score: Credit Scoreresidential_assets_value: Value of Residential Assetscommercial_assets_value: Value of Commercial Assetsluxury_assets_value: Value of Luxury Assetsbank_asset_value: Value of Bank Assetsloan_status: Approved / Rejected
These variables are further classified as:
Qualitative:
education,self_employed,loan_statusQuantitative:
no_of_dependents,income_annum,loan_amount,loan_term,cibil_score,residential_assets_value,commercial_assets_value,luxury_assets_value,bank_asset_value,loan_status
We excluded the loan_id variable using the
subset() function, considering its minimal contribution to
data analysis.
Summary of the data
## no_of_dependents education self_employed income_annum
## Min. :0.0 Graduate :2144 No :2119 Min. : 200000
## 1st Qu.:1.0 Not Graduate:2125 Yes:2150 1st Qu.:2700000
## Median :3.0 Median :5100000
## Mean :2.5 Mean :5059124
## 3rd Qu.:4.0 3rd Qu.:7500000
## Max. :5.0 Max. :9900000
## loan_amount loan_term cibil_score residential_assets_value
## Min. : 300000 Min. : 2.0 Min. :300 Min. : -100000
## 1st Qu.: 7700000 1st Qu.: 6.0 1st Qu.:453 1st Qu.: 2200000
## Median :14500000 Median :10.0 Median :600 Median : 5600000
## Mean :15133450 Mean :10.9 Mean :600 Mean : 7472617
## 3rd Qu.:21500000 3rd Qu.:16.0 3rd Qu.:748 3rd Qu.:11300000
## Max. :39500000 Max. :20.0 Max. :900 Max. :29100000
## commercial_assets_value luxury_assets_value bank_asset_value
## Min. : 0 Min. : 300000 Min. : 0
## 1st Qu.: 1300000 1st Qu.: 7500000 1st Qu.: 2300000
## Median : 3700000 Median :14600000 Median : 4600000
## Mean : 4973155 Mean :15126306 Mean : 4976692
## 3rd Qu.: 7600000 3rd Qu.:21700000 3rd Qu.: 7100000
## Max. :19400000 Max. :39200000 Max. :14700000
## loan_status
## Approved:2656
## Rejected:1613
##
##
##
##
The summary() function provides a statistical summary of
the entire dataset.
Visualizations
Data Pre-processing
We have removed unnecessary whitespaces from column names
Box plot of Number of Dependents and Loan Status
The box plot comparing the number of dependents and loan status reveals minimal variation. Both approved and rejected statuses exhibit similar distributions, with consistent median values. This suggests that the number of dependents has no significant impact on loan approval status.
Density plot of Loan Term based on Loan Status
The density plot highlights a clear relationship between loan approval/rejection and the loan term. Notably, applications with a term of 0-5 years show the highest approval rate, while those exceeding 5 years face more rejections. This pattern indicates a lending strategy favoring individuals capable of immediate repayment.
Scatter Plot of CIBIL Score vs Loan Amount
The scatter plot reveals a distinct correlation between loan amount and CIBIL score. Rejections are prominent in the CIBIL score range of 300-550, while approvals rise significantly beyond a CIBIL score of 550, even for loan amounts exceeding 35M.
Stacked Bar between Loan Status and Self Employment
The bar plot shows minimal differentiation in loan approval rates between self-employed and non-self-employed individuals, suggesting that self-employment may not be a decisive factor influencing loan approval.
Scatter Plot of Loan Amount vs Commercial assets value based on Loan Status
The scatter plot depicts a positive correlation between commercial asset value and loan amount, implying larger loans align with higher asset values. Similar distributions for approvals and rejections yield comparable approval and rejection rates for both commercial asset value and loan amount.
Scatter Loan Amount vs Residential Assets Value based on Loan Status
The graph indicates a positive correlation between residential asset value and loan amount, implying that higher residential values correspond to increased loan amounts. Similar distributions for approvals and rejections result in equal approval and rejection rates.
Scatter Plot of Loan Amount vs Luxury assets value based on Loan Status
The scatter plot reveals a positive correlation between luxury asset value and loan amount, suggesting that an increase in luxury asset value corresponds to higher loan amounts, with elevated chances of approval.
Scatter Plot of Loan Amount vs Bank Assets Value
Based on the graph, there is a positive correlation between luxury assets value and loan amount, indicating that as luxury assets value increases, so does the loan amount.
Scatter Plot of Loan Amount vs Income Per Annum based on Loan Status
The scatter plot indicates a direct correlation between annual income and loan amount, implying higher income aligns with larger loans. Approval and rejection rates appear consistent across different income and loan amount levels.
Box plot between Cibil Score vs Self Employed based on Loan Status
The box plot analysis indicates that CIBIL score significantly influences loan status, whereas self-employment status does not exhibit a notable impact.
Density Plot of Bank Assets grouped by Loan Status
Analyzing the density plot reveals a consistent loan status irrespective of bank assets, suggesting loan approval’s independence from bank assets sfluctuations.
Density Plot of CIBIL Score grouped by Loan Status
The density plot highlights that applicants with a higher CIBIL score (typically 500+) have a greater likelihood of loan approval, while applications with a score below that are rejected. This underscores the crucial role of a good CIBIL score and its significant impact on loan applications.
Correlation Plot
The correlation plot underscores the crucial link between
cibil_score and loan_status, highlighting a
strong correlation. Additionally, loan_amount shows
noteworthy connections with various asset values, underscoring the
importance of creditworthiness and financial factors in loan
approval.
#STATISTICAL TEST
T-Test on Loan Status (Approval/Rejection) and Cibil Score
##
## Welch Two Sample t-test
##
## data: Approved$cibil_score and rejected$cibil_score
## t = 88, df = 4263, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 268 280
## sample estimates:
## mean of x mean of y
## 703 429
Null Hypothesis (\(H_{0}\)): CIBIL score has no significant association with loan status.
Alternate Hypothesis (\(H_{A}\)): CIBIL score has significant association with loan status.
The p-value \(0\), is
very less than the standard alpha value of 0.05, hence, we
reject the NULL hypothesis and conclude that CIBIL score has significant
association with the probability of loan approval.
Chi-squared test between Education and Loan Status
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: c
## X-squared = 0.08, df = 1, p-value = 0.8
Null Hypothesis (\(H_{0}\)): Education level and loan status are independent of each other.
Alternate Hypothesis (\(H_{A}\)): Education level and loan status are dependent on each other.
The high p-value of \(0.772\) for education level and loan status
leads to the acceptance of the null hypothesis. Consequently, we
conclude that an applicant’s education level has no significant impact
on loan approval.
T-Test between Loan Status and Bank Asset Value
##
## Welch Two Sample t-test
##
## data: Approved$bank_asset_value and rejected$bank_asset_value
## t = -0.4, df = 3453, p-value = 0.7
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -245677 154809
## sample estimates:
## mean of x mean of y
## 4959526 5004960
Null Hypothesis (\(H_{0}\)): Bank asset value and loan status are independent of each other.
Alternative Hypothesis (\(H_{A}\)): Bank asset value and loan status are dependent on each other.
Bank Asset Value and loan status have a high p-value of
\(0.656\). Thus, we cannot reject the
null hypothesis. We can therefore state that bank asset value and loan
status are independent of each other and are not significantly
associated.
T-Test between Loan Status and Resedential Assets Value
##
## Welch Two Sample t-test
##
## data: Approved$residential_assets_value and rejected$residential_assets_value
## t = -0.9, df = 3400, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -595310 209937
## sample estimates:
## mean of x mean of y
## 7399812 7592498
Null Hypothesis (\(H_{0}\)): There is no significant association between the values of residential asset and loan approval status.
Alternative Hypothesis (\(H_{A}\)): There is a significant association between the values of residential asset and loan approval status.
With a p-value of \(0.348\), we cannot reject the null
hypothesis and thus, we conclude from the null hypothesis that there
exists no significant association between residential assets value and
loan status.
T-test between number of dependents and loan status
##
## Welch Two Sample t-test
##
## data: Approved$no_of_dependents and rejected$no_of_dependents
## t = -1, df = 3400, p-value = 0.2
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1683 0.0416
## sample estimates:
## mean of x mean of y
## 2.47 2.54
Null Hypothesis (\(H_{0}\)): There is no significant association between the number of dependents and loan approval status.
Alternative Hypothesis (\(H_{A}\)): There is a significant association between the number of dependents and loan approval status.
With a p-value of \(0.237\), we fail to reject the null
hypothesis. Therefore, we can conclude that there is no significant
association between the number of dependents and loan status.
T-test between luxury assets value and loan status
##
## Welch Two Sample t-test
##
## data: a$luxury_assets_value and r$luxury_assets_value
## t = -1, df = 3442, p-value = 0.3
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -851752 271073
## sample estimates:
## mean of x mean of y
## 15016604 15306944
Null Hypothesis (\(H_{0}\)): The luxury assets value of an applicant has no significant association with their loan status.
Alternate Hypothesis (\(H_{A}\)): The luxury assets value of an applicant has significant association with their loan status.
With a p-value of \(0.311\), we cannot reject the null
hypothesis and thus, we conclude that the luxury assets value of an
applicant has no significant association with their loan status.
## Pearson Correlation Coefficient: 0.00844
## p-value: 0.582
The low Pearson correlation coefficient \(0.008\) indicates a weak relationship
between loan_term and loan_amount. Moreover,
the high p-value \(0.582\)
suggests the observed correlation is not statistically significant.
Model Selection
Regression problem
## Reordering variables and trying again:
## Subset selection object
## Call: regsubsets.formula(loan_amount ~ ., data = data, nvmax = 10,
## nbest = 2, method = "exhaustive")
## 12 Variables (and intercept)
## Forced in Forced out
## no_of_dependents FALSE FALSE
## educationNot Graduate FALSE FALSE
## self_employedNo FALSE FALSE
## income_annum FALSE FALSE
## loan_term FALSE FALSE
## cibil_score FALSE FALSE
## residential_assets_value FALSE FALSE
## commercial_assets_value FALSE FALSE
## luxury_assets_value FALSE FALSE
## bank_asset_value FALSE FALSE
## loan_statusRejected FALSE FALSE
## self_employedOther FALSE FALSE
## 2 subsets of each size up to 11
## Selection Algorithm: exhaustive
## no_of_dependents educationNot Graduate self_employedNo
## 1 ( 1 ) " " " " " "
## 1 ( 2 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 2 ( 2 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 3 ( 2 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 4 ( 2 ) "*" " " " "
## 5 ( 1 ) "*" " " " "
## 5 ( 2 ) " " " " " "
## 6 ( 1 ) "*" " " " "
## 6 ( 2 ) "*" " " " "
## 7 ( 1 ) "*" " " " "
## 7 ( 2 ) "*" " " " "
## 8 ( 1 ) "*" " " " "
## 8 ( 2 ) "*" " " " "
## 9 ( 1 ) "*" " " " "
## 9 ( 2 ) "*" " " "*"
## 10 ( 1 ) "*" " " "*"
## 10 ( 2 ) "*" "*" " "
## 11 ( 1 ) "*" "*" "*"
## 11 ( 2 ) "*" " " "*"
## self_employedOther income_annum loan_term cibil_score
## 1 ( 1 ) " " "*" " " " "
## 1 ( 2 ) " " " " " " " "
## 2 ( 1 ) " " "*" " " " "
## 2 ( 2 ) " " "*" " " " "
## 3 ( 1 ) " " "*" " " "*"
## 3 ( 2 ) " " "*" " " " "
## 4 ( 1 ) " " "*" " " "*"
## 4 ( 2 ) " " "*" " " "*"
## 5 ( 1 ) " " "*" " " "*"
## 5 ( 2 ) " " "*" "*" "*"
## 6 ( 1 ) " " "*" "*" "*"
## 6 ( 2 ) " " "*" " " "*"
## 7 ( 1 ) " " "*" "*" "*"
## 7 ( 2 ) " " "*" "*" "*"
## 8 ( 1 ) " " "*" "*" "*"
## 8 ( 2 ) " " "*" "*" "*"
## 9 ( 1 ) " " "*" "*" "*"
## 9 ( 2 ) " " "*" "*" "*"
## 10 ( 1 ) " " "*" "*" "*"
## 10 ( 2 ) " " "*" "*" "*"
## 11 ( 1 ) " " "*" "*" "*"
## 11 ( 2 ) "*" "*" "*" "*"
## residential_assets_value commercial_assets_value luxury_assets_value
## 1 ( 1 ) " " " " " "
## 1 ( 2 ) " " " " "*"
## 2 ( 1 ) " " " " " "
## 2 ( 2 ) " " "*" " "
## 3 ( 1 ) " " " " " "
## 3 ( 2 ) " " "*" " "
## 4 ( 1 ) " " "*" " "
## 4 ( 2 ) " " " " " "
## 5 ( 1 ) " " "*" " "
## 5 ( 2 ) " " "*" " "
## 6 ( 1 ) " " "*" " "
## 6 ( 2 ) "*" "*" " "
## 7 ( 1 ) "*" "*" " "
## 7 ( 2 ) " " "*" "*"
## 8 ( 1 ) "*" "*" "*"
## 8 ( 2 ) "*" "*" " "
## 9 ( 1 ) "*" "*" "*"
## 9 ( 2 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 10 ( 2 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 11 ( 2 ) "*" "*" "*"
## bank_asset_value loan_statusRejected
## 1 ( 1 ) " " " "
## 1 ( 2 ) " " " "
## 2 ( 1 ) " " "*"
## 2 ( 2 ) " " " "
## 3 ( 1 ) " " "*"
## 3 ( 2 ) " " "*"
## 4 ( 1 ) " " "*"
## 4 ( 2 ) " " "*"
## 5 ( 1 ) " " "*"
## 5 ( 2 ) " " "*"
## 6 ( 1 ) " " "*"
## 6 ( 2 ) " " "*"
## 7 ( 1 ) " " "*"
## 7 ( 2 ) " " "*"
## 8 ( 1 ) " " "*"
## 8 ( 2 ) "*" "*"
## 9 ( 1 ) "*" "*"
## 9 ( 2 ) " " "*"
## 10 ( 1 ) "*" "*"
## 10 ( 2 ) "*" "*"
## 11 ( 1 ) "*" "*"
## 11 ( 2 ) "*" "*"
The regsubsets() identifies the optimal subset of
predictor variables by minimizing or maximizing selected criteria like
Adjusted R-squared (adjr2), R-squared (r2), Bayesian Information
Criterion (BIC), or Mallows’ Cp.
In case of the Adjusted R-squared plot, the best possible set of
predictors are found to be: no_of_dependents,
loan_term, income_annum,
commercial_assests_value, cibil_score and
loan_status. In case of the R-squared plot, the best
possible set of predictors are found to be:
no_of_dependents, loan_term,
income_annum, commercial_assests_value,
cibil_score and loan_status. From the BIC plot
we can observe that the best possible set of predictors are found to be:
no_of_dependents, loan_term,
income_annum, residential_assets_value,
commercial_assests_value, cibil_score and
loan_status. From the Cp Mallow plot we can observe that
the best possible set of predictors are found to be:
no_of_dependents, income_annum,
residential_assets_value,
commercial_assests_value, cibil_score and
loan_status.
## Abbreviation
## no_of_dependents n
## educationNot Graduate eG
## self_employedNo s_N
## income_annum i
## loan_term ln_
## cibil_score cb_
## residential_assets_value r
## commercial_assets_value c__
## luxury_assets_value l__
## bank_asset_value b
## loan_statusRejected l_R
## self_employedOther s_O
From the Adjusted R-squared statistic plot, the most suitable set of
predictors are found to be: no_of_dependents,
loan_term, income_annum,
commercial_assests_value, cibil_score,
commercial_assests_value, luxury_assests_value
and loan_status.
## Abbreviation
## no_of_dependents n
## educationNot Graduate eG
## self_employedNo s_N
## income_annum i
## loan_term ln_
## cibil_score cb_
## residential_assets_value r
## commercial_assets_value c__
## luxury_assets_value l__
## bank_asset_value b
## loan_statusRejected l_R
## self_employedOther s_O
The most relevant predictors from the Mallow Cp plot are found to be
no_of_dependents, income_annum,
commercial_assests_value, cibil_score and
loan_status.
Classification Problem
## Fitting algorithm: AIC-glm
## Best Model:
## df deviance
## Null Model 4262 1880
## Full Model 4268 5661
##
## likelihood-ratio test - GLM
##
## data: H0: Null Model vs. H1: Best Fit AIC-glm
## X = 3780, df = 6, p-value <2e-16
## no_of_dependents education self_employed income_annum loan_amount loan_term
## 1 FALSE FALSE FALSE TRUE TRUE TRUE
## 2 FALSE FALSE FALSE TRUE TRUE TRUE
## 3 FALSE FALSE FALSE TRUE TRUE TRUE
## 4 FALSE TRUE FALSE TRUE TRUE TRUE
## 5 FALSE FALSE FALSE TRUE TRUE TRUE
## cibil_score residential_assets_value commercial_assets_value
## 1 TRUE FALSE FALSE
## 2 TRUE FALSE FALSE
## 3 TRUE FALSE TRUE
## 4 TRUE FALSE FALSE
## 5 TRUE FALSE TRUE
## luxury_assets_value bank_asset_value Criterion
## 1 TRUE TRUE 1892
## 2 TRUE FALSE 1893
## 3 TRUE TRUE 1893
## 4 TRUE TRUE 1894
## 5 TRUE FALSE 1894
## no_of_dependents education self_employed income_annum loan_amount
## Mode :logical Mode :logical Mode :logical Mode:logical Mode:logical
## FALSE:5 FALSE:4 FALSE:5 TRUE:5 TRUE:5
## TRUE :1
##
##
##
## loan_term cibil_score residential_assets_value commercial_assets_value
## Mode:logical Mode:logical Mode :logical Mode :logical
## TRUE:5 TRUE:5 FALSE:5 FALSE:3
## TRUE :2
##
##
##
## luxury_assets_value bank_asset_value Criterion
## Mode:logical Mode :logical Min. :1892
## TRUE:5 FALSE:2 1st Qu.:1893
## TRUE :3 Median :1893
## Mean :1893
## 3rd Qu.:1894
## Max. :1894
Model Creation
In our analysis, we utilize a dual approach, employing regression for predicting loan amount and classification for determining loan status. This enhances our model’s predictive capabilities in addressing different facets of the loan application process.
Train Test Split
The dataset was initially examined for the distribution of the target variable, loan_status (indicating loan approval). To promote model generalization, the data was later divided into training (80%) and test (20%) sets, ensuring reproducibility with a specified seed.
##
## Approved Rejected
## 2656 1613
## Rejected
## 0.378
## [1] 0.8
## [1] 3415
## [1] 854
Regression
Linear Regression
A linear regression model was constructed using the lm()
function, predicting loan_amount based on various features, including
no_of_dependents, loan_term,
income_annum, commercial_assets_value,
cibil_score, and loan_status.
##
## Call:
## lm(formula = loan_amount ~ no_of_dependents + loan_term + income_annum +
## commercial_assets_value + cibil_score + loan_status, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9773675 -1894729 -32406 2008070 10051244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.02e+06 3.60e+05 5.62 2.0e-08 ***
## no_of_dependents -4.88e+04 3.03e+04 -1.61 0.108
## loan_term 9.14e+03 9.17e+03 1.00 0.319
## income_annum 2.96e+00 2.39e-02 123.95 < 2e-16 ***
## commercial_assets_value 3.02e-02 1.53e-02 1.98 0.048 *
## cibil_score -2.51e+03 4.73e+02 -5.30 1.2e-07 ***
## loan_statusRejected -1.26e+06 1.69e+05 -7.41 1.5e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3360000 on 4262 degrees of freedom
## Multiple R-squared: 0.862, Adjusted R-squared: 0.862
## F-statistic: 4.45e+03 on 6 and 4262 DF, p-value: <2e-16
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 2.02e+06 | 3.60e+05 | 5.620 | 0.0000 |
| no_of_dependents | -4.88e+04 | 3.03e+04 | -1.610 | 0.1075 |
| loan_term | 9.14e+03 | 9.17e+03 | 0.997 | 0.3190 |
| income_annum | 2.96e+00 | 2.39e-02 | 123.951 | 0.0000 |
| commercial_assets_value | 3.02e-02 | 1.53e-02 | 1.977 | 0.0481 |
| cibil_score | -2.51e+03 | 4.73e+02 | -5.303 | 0.0000 |
| loan_statusRejected | -1.26e+06 | 1.69e+05 | -7.413 | 0.0000 |
| cibil_score | commercial_assets_value | income_annum | loan_statusRejected | loan_term | no_of_dependents |
|---|---|---|---|---|---|
| 2.52 | 1.7 | 1.7 | 2.55 | 1.04 | 1 |
The summary statistics and variance inflation factor (VIF) were analyzed for insights, which gives us values lesser than 3 which means there is no multicollinearity in our features.
Results
## Training R-squared: 0.862
## Testing R-squared: 0.862
The scatter plot displays the actual versus predicted loan amounts, with a dashed red line denoting the ideal prediction scenario. R-squared values (r test_r_squared) for training and testing underscore the model’s strong explanatory power and generalization to new data.
Classification
Logistic Regression
This study uses logistic regression to create a predictive model for loan status, leveraging the glm() function. Key features, including dependents, annual income, loan amount, term, CIBIL score, luxury assets value, and bank assets value, are highlighted.
Model building
##
## Call:
## glm(formula = loan_status ~ no_of_dependents + income_annum +
## loan_amount + loan_term + cibil_score + luxury_assets_value +
## bank_asset_value, family = "binomial", data = data_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.13e+01 4.76e-01 23.67 < 2e-16 ***
## no_of_dependents 6.60e-03 3.85e-02 0.17 0.86
## income_annum 5.26e-07 9.54e-08 5.51 3.5e-08 ***
## loan_amount -1.33e-07 1.99e-08 -6.72 1.9e-11 ***
## loan_term 1.49e-01 1.27e-02 11.72 < 2e-16 ***
## cibil_score -2.46e-02 9.24e-04 -26.59 < 2e-16 ***
## luxury_assets_value -2.66e-08 1.97e-08 -1.35 0.18
## bank_asset_value -3.98e-08 3.63e-08 -1.10 0.27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4528.9 on 3414 degrees of freedom
## Residual deviance: 1531.7 on 3407 degrees of freedom
## AIC: 1548
##
## Number of Fisher Scoring iterations: 7
The coefficients table reveals the estimated effects of each predictor on the log-odds of loan approval. Key findings include:
The intercept has a substantial positive effect on the log-odds.
Variables such as loan_term and CIBIL_score
significantly impact loan approval, as indicated by their respective
z-values and low p-values.
no_of_dependents, income_annum,
loan_amount, luxury_assets_value, and
bank_asset_value show minimal impact on loan approval.
Feature Importance
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | 11.2729 | 0.4762 | 23.671 | 0.000 |
| no_of_dependents | 0.0066 | 0.0385 | 0.171 | 0.864 |
| income_annum | 0.0000 | 0.0000 | 5.512 | 0.000 |
| loan_amount | 0.0000 | 0.0000 | -6.715 | 0.000 |
| loan_term | 0.1491 | 0.0127 | 11.718 | 0.000 |
| cibil_score | -0.0246 | 0.0009 | -26.587 | 0.000 |
| luxury_assets_value | 0.0000 | 0.0000 | -1.353 | 0.176 |
| bank_asset_value | 0.0000 | 0.0000 | -1.097 | 0.273 |
| x | |
|---|---|
| (Intercept) | 7.87e+04 |
| no_of_dependents | 1.01e+00 |
| income_annum | 1.00e+00 |
| loan_amount | 1.00e+00 |
| loan_term | 1.16e+00 |
| cibil_score | 9.76e-01 |
| luxury_assets_value | 1.00e+00 |
| bank_asset_value | 1.00e+00 |
Feature importance summary:
- The intercept is notably high, serving as a baseline for loan approval odds.
- Number of dependents has minimal impact (non-significant).
- Annual income and loan amount show limited influence on loan approval odds (coefficient of 1.00).
- Loan term has a substantial positive impact on approval odds (16% increase per unit).
- Higher CIBIL scores correspond to lower odds of loan approval.
- Luxury assets and bank assets show minimal impact on approval odds.
Data imbalance
- The data imbalance graph indicates a skewed distribution of approved and rejected applicants, both left and right skewed. Sole reliance on accuracy and ROC scores may be insufficient for our predictive analysis.
Train and test metrices
## [1] "Training Accuracy: 91.45 %"
## [1] "Test Accuracy: 93.09 %"
## [1] "Training Precision: 88.92 %"
## [1] "Training Recall: 88.51 %"
## [1] "Test Precision: 90.68 %"
## [1] "Test Recall: 90.97 %"
In the logistic regression model, the following performance metrics were observed:
- Training Accuracy: 91.45%
- Test Accuracy: 93.09%
- Training Precision: 88.92%
- Training Recall: 88.51%
- Test Precision: 90.68%
- Test Recall: 90.97%
These metrics reveal the model’s proficiency in predicting loan approval status with high accuracy and precision. Balanced recall scores suggest effective capture of both approved and rejected instances, showcasing the logistic regression model’s robust performance on training and test sets.
Confusion matrix
| Predicted Approved | Predicted Rejected | Total | |
|---|---|---|---|
| Actual Approved | 1979 | 145 | 2124 |
| Actual Rejected | 150 | 1141 | 1291 |
| Total | 2129 | 1286 | 3415 |
This confusion matrix details the model’s predictions for Approved and Rejected cases, including True Positives (1979), False Positives (145), False Negatives (150), and True Negatives (1141). These metrics help calculate evaluation measures like precision, recall, and accuracy.
Receiver-Operator-Characteristic (ROC) curve and Area-Under-Curve (AUC)
ROC and AUC curves measure the true positive rate (or sensitivity) against the false positive rate (or specificity). The AUC is always between 0.5 and 1. Values higher than 0.8 are considered good model fit.
McFadden pseudo R-squared
## 'log Lik.' 0.729 (df=8)
In logistic regression, the log-likelihood value stands at r mcFadden[1] with 8 degrees of freedom. This value is integral in computing McFadden’s pseudo R-squared, offering insights into the model’s fit relative to a basic intercept-only model. A higher pseudo R-squared value is typically indicative of a superior fit. The log-likelihood value of r mcFadden[1] plays a crucial role in evaluating the goodness-of-fit in our logistic regression analysis.
Data preprocessing
Factoring the data
## 'data.frame': 3415 obs. of 14 variables:
## $ no_of_dependents : int 3 3 2 3 5 4 3 3 3 5 ...
## $ education : Factor w/ 2 levels "Graduate","Not Graduate": 2 1 1 1 1 1 2 1 2 2 ...
## $ self_employed : num 2 1 2 2 1 2 2 2 1 1 ...
## $ income_annum : int 1600000 6700000 4700000 8000000 4200000 8700000 3800000 5500000 6200000 2700000 ...
## $ loan_amount : int 3900000 13400000 14000000 26200000 9400000 32100000 11400000 11600000 14600000 6800000 ...
## $ loan_term : int 20 14 12 16 6 8 4 6 2 6 ...
## $ cibil_score : int 804 782 784 890 678 397 323 745 664 461 ...
## $ residential_assets_value: int 900000 4900000 13400000 15800000 4700000 8900000 5800000 9400000 18300000 4500000 ...
## $ commercial_assets_value : int 2500000 4000000 2700000 4300000 5500000 17000000 4000000 8000000 8400000 2000000 ...
## $ luxury_assets_value : int 3200000 25800000 14800000 25000000 9900000 34700000 9600000 12800000 16800000 8200000 ...
## $ bank_asset_value : int 800000 9600000 6900000 4000000 5700000 4800000 4400000 5400000 6600000 1700000 ...
## $ loan_status : Factor w/ 2 levels "Approved","Rejected": 1 1 1 1 1 2 1 1 1 2 ...
## $ prediction : num 0.00512 0.0057 0.00193 0.00025 0.01819 ...
## $ prob : num 0.00512 0.0057 0.00193 0.00025 0.01819 ...
KNN Model creation
Finding the best K value
## num [1:2, 1:11] 1 0.542 3 0.573 5 ...
The above analysis delineates the influence of the number of neighbors (k) on the accuracy of loan status predictions. The accuracy vs. k chart reveals that selecting k=18 yields the highest accuracy in this particular kNN model.
KNN Evaluation metrices
For training
## Actual
## Predicted Approved Rejected
## Approved 1855 980
## Rejected 269 311
## [1] "Training Accuracy: 63.43 %"
The confusion matrix for the training data reveals that out of 2,625 instances, the model correctly predicted 1,855 approved and 311 rejected cases. The training accuracy is calculated at 63.43%.
For testing
## Factor w/ 2 levels "Approved","Rejected": 1 1 1 1 1 2 1 1 2 2 ...
## [1] 854
## bank_18NN
## Approved Rejected
## 704 150
## [1] "Test Accuracy: 61.12 %"
From the above test results it is evident that accuracy is approximately 61.12%, suggesting that the kNN model with k=18 achieved this level of accuracy in correctly predicting loan approval status on the given test data.
Confusion matrix
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 854
##
##
## | bank_18NN
## data_test[, "loan_status"] | Approved | Rejected | Row Total |
## ---------------------------|-----------|-----------|-----------|
## Approved | 452 | 80 | 532 |
## | 0.850 | 0.150 | 0.623 |
## | 0.642 | 0.533 | |
## | 0.529 | 0.094 | |
## ---------------------------|-----------|-----------|-----------|
## Rejected | 252 | 70 | 322 |
## | 0.783 | 0.217 | 0.377 |
## | 0.358 | 0.467 | |
## | 0.295 | 0.082 | |
## ---------------------------|-----------|-----------|-----------|
## Column Total | 704 | 150 | 854 |
## | 0.824 | 0.176 | |
## ---------------------------|-----------|-----------|-----------|
##
##
The k-Nearest Neighbor (kNN) model achieved around 61.12% accuracy on the test dataset. Among 854 instances, it correctly predicted 532 “Approved” loans, achieving a precision of 90.68%, and accurately classified 70 “Rejected” loans, resulting in a recall of 46.7%.
Testing results
## Confusion Matrix and Statistics
##
## Reference
## Prediction Approved Rejected
## Approved 452 252
## Rejected 80 70
##
## Accuracy : 0.611
## 95% CI : (0.578, 0.644)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : 0.771
##
## Kappa : 0.075
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.850
## Specificity : 0.217
## Pos Pred Value : 0.642
## Neg Pred Value : 0.467
## Prevalence : 0.623
## Detection Rate : 0.529
## Detection Prevalence : 0.824
## Balanced Accuracy : 0.534
##
## 'Positive' Class : Approved
##
## [1] "Precision: 0.64"
## [1] "Recall: 0.85"
The k-Nearest Neighbor model outperforms the competition in identifying “Approved” loans with a precision of 64% and a recall of 85%, indicating slight agreement beyond chance.
AUC ROC curve for KNN
With an AUC of 0.534 and test/train accuracies at 61.1%/63.43%, the model’s performance is subpar. Recognizing this, we’re exploring alternative non-parametric models for enhanced predictability.
Decision Tree model
Training
## [1] "Train Accuracy: 96.81 %"
Testing
## [1] "Test Accuracy: 96.49 %"
Testing results
## Confusion Matrix and Statistics
##
## Reference
## Prediction Approved Rejected
## Approved 529 27
## Rejected 3 295
##
## Accuracy : 0.965
## 95% CI : (0.95, 0.976)
## No Information Rate : 0.623
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.924
##
## Mcnemar's Test P-Value : 2.68e-05
##
## Sensitivity : 0.994
## Specificity : 0.916
## Pos Pred Value : 0.951
## Neg Pred Value : 0.990
## Prevalence : 0.623
## Detection Rate : 0.619
## Detection Prevalence : 0.651
## Balanced Accuracy : 0.955
##
## 'Positive' Class : Approved
##
## [1] "Precision: 0.95"
## [1] "Recall: 0.99"
Based on the model results, we obtain a precision and recall score of 95% and 99% respectively.
AUC ROC curve of decision trees
## AUC: 0.955
We obtain the above AUC-ROC curve with an area under the curve value of ~95.5%, indicating that this model is a good fit for our data.
AUC Scores
## [1] "AUC score of KNN 0.533507682249101"
## [1] "AUC score of decision Tress 0.95525498528931"
## [1] "AUC score of Logistic regressor 0.96766802184032"
So far, based on the three models we have implemented, Logistic Regression turns out to be the best performer with an AUC score of ~96.7%.
Random forest
##
## Call:
## randomForest(formula = loan_status ~ no_of_dependents + income_annum + loan_amount + loan_term + cibil_score + luxury_assets_value + bank_asset_value, data = data_train, ntree = 500, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 1.58%
## Confusion matrix:
## Approved Rejected class.error
## Approved 2108 16 0.00753
## Rejected 38 1253 0.02943
- The confusion matrix indicates strong performance in predicting both “Approved” and “Rejected” classes.
- Class error rates are low, with 0.75% for “Approved” and 2.94% for “Rejected,” showcasing accurate predictions.
- The model’s ability to correctly identify instances is evident from the high numbers on the diagonal of the confusion matrix.
Overall, the random forest model proves effective in classifying loan applications based on provided features, demonstrating a low out-of-bag error rate.
Feature Importance Summary
## Approved Rejected MeanDecreaseAccuracy MeanDecreaseGini
## no_of_dependents 1.33 4.34 3.74 15.8
## income_annum 15.75 14.63 21.43 37.2
## loan_amount 26.23 17.25 32.05 58.8
## loan_term 87.47 80.78 103.77 95.8
## cibil_score 383.23 399.68 440.78 1328.5
## luxury_assets_value 12.77 10.13 16.37 38.7
## bank_asset_value 10.79 9.59 14.86 29.8
- Loan Term (loan_term):
- Highest importance in both accuracy and Gini impurity reduction.
- CIBIL Score (cibil_score):
- Significantly important for predictive accuracy and reducing impurity.
- Loan Amount (loan_amount):
- Shows substantial importance in both metrics.
- Income (income_annum) and Luxury Assets
(luxury_assets_value):
- Moderately important.
- Bank Asset Value (bank_asset_value):
- Relatively lower importance.
- Number of Dependents (no_of_dependents):
- Appears least impactful on model performance.
Model metrics random forest
Training and training accuracy
## [1] "Train Accuracy: 100 %"
## [1] "Test Accuracy: 98.59 %"
## [1] "Precision: 0.97"
## [1] "Recall: 0.99"
-The model attained flawless accuracy on the training set, showcasing its ability to learn from the dataset. It also sustained a high accuracy of 98.59% on the test data, indicating strong generalization to new instances.
- The scores suggest high precision (0.97%) in correctly identifying positives and strong recall (0.99%) in capturing most actual positives. The model excels in both precision and recall, showcasing its effectiveness in loan application classification.
AUC ROC Curve
## AUC: 0.983
The random forest model achieved an Area Under the Curve (AUC) score of 95.53%.
An AUC score nearing 1 signifies strong discriminatory power, indicating the model excels in distinguishing between classes. The high AUC underscores the random forest model’s robustness in classifying loan applications.
AUC Scores
## [1] "AUC score of KNN 0.533507682249101"
## [1] "AUC score of decision Tress 0.95525498528931"
## [1] "AUC score of Logistic regressor 0.96766802184032"
## [1] "AUC score of Random Forest 0.983205295848316"
Model Result chart
Random Forest: Achieving the highest AUC score of 98.321%, the Random Forest model demonstrates superior discriminative capability. This underscores its suitability for the given classification task.
Logistic Regression: With an AUC score of 96.767%, Logistic Regression performs admirably and proves to be a robust model for the task.
Decision Tree: The Decision Tree model yields a respectable AUC score of 95.525%, positioning it as a viable choice for classification.
KNN: While KNN lags behind the other models with an AUC score of 53.351%, its performance is noteworthy and may still be relevant depending on the specific requirements of the application.
Conclusion
In summary, our project delves into a Kaggle dataset to scrutinize the intricacies of individual loan approval. Drawing from our experiences as international students, we provide insights into the expanding lending landscape. Our results underscore the pivotal role of the CIBIL score and employ various prediction models for precision. We advise applicants to prioritize their credit scores and align preferences with financial goals. Simultaneously, lending institutions can bolster decision-making efficiency. Overall, our work enhances the comprehension of loan dynamics, offering advantages to lenders and borrowers globally
References
[1] Sheikh, M. A., Goel, A. K., & Kumar, T. (2020, July). An approach for prediction of loan approval using machine learning algorithm. In 2020 International Conference on Electronics and Sustainable Communication Systems (ICESC)(pp. 490-494). IEEE.
[2] Ndayisenga, T. (2021). Bank loan approval prediction using machine learning techniques (Doctoral dissertation).
[3] Murthy, P. S., Shekar, G. S., Rohith, P., & Reddy, G. V. V. (2020). Loan Approval Prediction System Using Machine Learning. Journal of Innovation in Information Technology, 4(1), 21-24.